In addition to running the code, you will need to do two things: either write in complete sentences a response or write R code to fill in a chunk.
“xxxxx” = Write, in complete English sentences, your response. 1-2 sentences is sufficient.
“Plot” = Write R code in the chunk corresponding to the instructions.
For this problem set, we’ll examine a data set of stops by the Charlotte-Mecklenburg Police Department (CMPD).
Our focus will be to understand what factors are related to whether someone is searched or not for a traffic stop.
For our data set, you’ll load the csv file we saved in the setup. This should be in your data folder.
library(tidyverse)
library(scales)
library(ggspatial) # make sure to install if you don't have it
df <- read_csv("data/Officer_Traffic_Stops.csv")
First, look at the data using the glimpse() function from dplyr
glimpse(df)
## Observations: 68,488
## Variables: 17
## $ Month_of_Stop <chr> "2017/06", "2017/06", "2017/06", "2017/…
## $ Reason_for_Stop <chr> "Vehicle Regulatory", "Vehicle Regulato…
## $ Officer_Race <chr> "White", "White", "White", "White", "Wh…
## $ Officer_Gender <chr> "Female", "Male", "Male", "Male", "Male…
## $ Officer_Years_of_Service <dbl> 3, 2, 5, 10, 2, 3, 2, 5, 25, 10, 25, 12…
## $ Driver_Race <chr> "Black", "Black", "Black", "White", "Bl…
## $ Driver_Ethnicity <chr> "Non-Hispanic", "Non-Hispanic", "Non-Hi…
## $ Driver_Gender <chr> "Male", "Male", "Male", "Male", "Female…
## $ Driver_Age <dbl> 23, 36, 45, 30, 37, 50, 50, 19, 41, 22,…
## $ Was_a_Search_Conducted <chr> "No", "No", "No", "No", "No", "No", "No…
## $ Result_of_Stop <chr> "Verbal Warning", "Verbal Warning", "Ve…
## $ CMPD_Division <chr> "North Tryon Division", "Central Divisi…
## $ ObjectID <dbl> 3001, 3002, 3003, 3004, 3005, 3006, 300…
## $ CreationDate <dttm> 2019-02-16 10:01:44, 2019-02-16 10:01:…
## $ Creator <chr> "CharlotteNC", "CharlotteNC", "Charlott…
## $ EditDate <dttm> 2019-02-16 10:01:44, 2019-02-16 10:01:…
## $ Editor <chr> "CharlotteNC", "CharlotteNC", "Charlott…
Notice the different variable types: character (chr), num (numeric), and datetime (POSIXct).
Let’s consider our target variable: Was_a_Search_Conducted.
Plot a bar chart that counts the number of records by Was_a_Search_Conducted.
ggplot(df, aes(x=Was_a_Search_Conducted))+geom_bar()
How well balanced is the data set by this field? The data set heavily unbalanced by this field in the case of a sarch not being conducted.
Next, let’s consider the age range of the driver.
Plot a histogram of Driver_Age. Determine an appropriate number of bins.
ggplot(df, aes(x=Driver_Age))+geom_histogram(binwidth=5)
Once you go above (around) 40-50 bins, you’ll notice some points stick out.
What is happening? xxxxx
Plot a density plot of Driver_Age. Add in a fill to be “lightblue”. Determine an appropriate kernel density to use (adjust).
ggplot(df, aes(x=Driver_Age))+geom_density(fill='lightblue', adjust=3)
Plot a box plot with Was_a_Search_Conducted on the x-axis and Driver_Age on the y-axis.
ggplot(df, aes(x=Was_a_Search_Conducted, y=Driver_Age))+geom_boxplot()
Plot a violin plot.
ggplot(df, aes(x=Was_a_Search_Conducted, y=Driver_Age))+geom_violin()
From the plots above, do you think the age of the driver is a significant factor in whether a search was conducted? Why or why not?
From the plots above I absolutely think age of the driver plays a significant factor in wheather a search is conducted. Looking at the boxplot you can see that the average age, quartiles, and outlers of those pulled over and searched is much lower than that of those that weren’t searched. In the violin plot you can see that the distribution of ages is much different in those that were searched compared to those that weren’t. You can also see that this distrubution has much more younger drivers in the search conducted category.
Let’s plot the number of stops by time.
Recalling part one, the Month_of_Stop variable is a character, not a date variable. The datatime’s are simply when the data was collected; not when the stop occurred. Therefore, we’ll need to convert the Month_of_Stop variable from a character to a Date format.
Let’s first cleanup the date field using tidyverse packages like stringr and lubridate.
library(stringr); library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
# see https://dsba5122fall2019.slack.com/archives/CLUCHHQPJ/p1569273552006200
df <- mutate(df, Month_of_Stop = str_replace_all(Month_of_Stop, "/","-")) # replace "/" with "-"
df <- mutate(df, Month_of_Stop = paste0(df$Month_of_Stop,"-01")) # add in day
df <- mutate(df, Date = ymd(Month_of_Stop)) # created a date field
Plot a line chart with the number of traffic stops for each month (hint: start with the count() function by Date then feed into ggplot. Remember the count variable is named ‘n’.).
D <- df %>%
count(Date)
ggplot(D, aes(x=Date, y=n))+geom_line()
What is the trend (i.e., long term rate of change) of the number of traffic stops in Charlotte?
The trend would appear to be that the amount of traffic stops tends to spike early in the year and then drop off as the year goes on.
Plot the same plot but add in facet_wrap() by the Reason_for_Stop variable.
D2 <- df %>%
group_by(Reason_for_Stop) %>%
count(Date)
ggplot(D2, aes(x=Date, y=n))+geom_line()+facet_wrap(~Reason_for_Stop)
What is a problem with this plot?
A problem with this plot is that it is difficult to make interpretations on the y-axis since each graph is on the same scale.
To address this problem, you will need to figure out how to adjust the scale. To do this, you need to use R’s documentation to see whether there is a parameter in facet_wrap.
Go to your RStudio console and type ?facet_wrap.
What parameter allows you to modify the scales of facet_wrap? scales
Plot the same plot but with a free y-axis scale.
D2 <- df %>%
group_by(Reason_for_Stop) %>%
count(Date)
ggplot(D2, aes(x=Date, y=n))+geom_line()+facet_wrap(~Reason_for_Stop, scales='free_y')
Which type of police stop has had the most volatility (i.e., big swings in number of stops)? Speeding has the biggest and most number of swings.
What is one problem with allowing the y-axis be free? It is difficult to make comparisons between graphs since they are on different scales.
Small multiples tends to be less effective when each of the variables are on different scales or magnitudes.
Let’s consider instead CMPD traffic stops but by CMPD division. These are more even spread by division than the type of stop.
Plot a line chart (optional points too) for stops by Date (x axis) and counts (‘n’, or whatever you named your count variable) (y axis). (hint: to modify how the date is shown, use the layer scale_x_date(date_labels = "%Y") + to show only the year. Feel free to modify by looking at ?scale_x_date.)
D3 <- df %>%
group_by(CMPD_Division) %>%
count(Date)
ggplot(D3, aes(x=Date, y=n))+geom_line()+facet_wrap(~CMPD_Division)+scale_x_date(date_labels = "%Y")
What are three observations you can make about the number of police stops by divison? (hint: just write about what’s in the data.)
North Tryon Division, North Division, and Steele Creel Division all have big spikes around 2017.
There is a lot of variance between number of police stops amongst most of the divisions.
A lot of the divisions number of stops seems to be around 300.
Next, this doesn’t help tell us where these areas are. For that, let’s use a shape file to create a chloropleth of stops by division.
For this example, we’ll create a cholorpleth for the number of police stops by police division.
To do this, we need to use the sf package. (For help along the way, see this tutorial on sf package.)
library(sf); library(viridis)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
cmpd <- st_read("./data/CMPD_Police_Divisions/CMPD_Police_Divisions.shp")
## Reading layer `CMPD_Police_Divisions' from data source `/Users/jackphillips/Documents/problem_set-2-phillips/data/CMPD_Police_Divisions/CMPD_Police_Divisions.shp' using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -81.05795 ymin: 35.00218 xmax: -80.55006 ymax: 35.52533
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
Note that while we have five files, we only load in the shapefile (.shp) file. This is typical but know that to use this file you would need the other four files in the same folder as your shapefile.
Plot cmpd using the geom_sf package where you provide fill = DNAME as the only aesthetic. Add in a title saying “CMPD Divisions” and add the theme_bw() theme to make translate the file into the black and white template.
ggplot(cmpd) + geom_sf(aes(fill =DNAME))+theme_bw() + ggtitle("CMPD Divisions")
One problem with this map is it’s hard to read the division names. That is, it may be better to remove the legend and put the labels of each division within the plot.
To do this, we can use the related geom_sf_label() geom, using the name of the division as the aesthetic label.
Plot the same plot from above but with the name of the division as the label.
cmpd %>%
mutate(Name = as.character(DNAME)) %>%
mutate(Name = str_replace_all(Name, " Division","")) %>%
ggplot() + geom_sf(aes(fill=Name), show.legend = FALSE)+ geom_sf_label(aes(label = Name), size=1.5)+theme_bw() + ggtitle("CMPD Divisions")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
## not give correct results for longitude/latitude data
You’ll likely need to reduce the size of the label, using the size paramater. You should likely set the size to under 2.
Make sure to remove the legend (it’s redundant and no longer necessary).
Create a new variable named Name that removes the term " Division". This term is redundant and takes up a lot of space in the labels from DNAME. To do this step, use this snippet of code at the top of your pipeline:
cmpd %>%
mutate(Name = as.character(DNAME)) %>%
mutate(Name = str_replace_all(Name, " Division",""))
g. Make sure to call it once so that the map will output.g <- cmpd %>%
mutate(Name = as.character(DNAME)) %>%
mutate(Name = str_replace_all(Name, " Division",""))
g
## Simple feature collection with 15 features and 13 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -81.05795 ymin: 35.00218 xmax: -80.55006 ymax: 35.52533
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## First 10 features:
## OBJECTID DIVISION DNAME
## 1 1 01 Central Division
## 2 2 02 Metro Division
## 3 3 06 Eastway Division
## 4 4 07 North Tryon Division
## 5 5 11 North Division
## 6 6 12 Hickory Grove Division
## 7 7 14 University City Division
## 8 8 16 Providence Division
## 9 9 17 Independence Division
## 10 10 21 Steele Creek Division
## WNAME GlobalID
## 1 Eastway Wrecker Service 58a97461-186e-4abb-a400-2ea179123eff
## 2 Hunter Auto and Wrecker Service 0983dd29-56d2-4c46-8526-a80e36e817d8
## 3 Hunter Auto and Wrecker Service a5163030-34cc-458f-9a78-47ec9c31424f
## 4 Hunter Auto and Wrecker Service 8631db25-1881-496f-b00c-e41808ab8860
## 5 Hunter Auto and Wrecker Service a9fd62c7-2c07-4172-8f9b-0524332674cf
## 6 Southern Star of Charlotte, Inc 4d4edf87-a411-4dc5-ad91-454455d09b68
## 7 Larry Campbell's Towing & Recovery 536f35e0-3f29-4441-bfca-a7a1db729ac4
## 8 Williams Wrecker e5075dea-f71b-4da0-9125-81ecb98254c3
## 9 Williams Wrecker 8632dd32-2a72-40fd-9997-89803e572602
## 10 Dellinger Wrecker Service 9a701587-537a-437b-ad24-f57bb5539e50
## DIVISION_C
## 1 http://charlottenc.gov/CMPD/ResponseAreas/Pages/CENTRAL.aspx\r\n
## 2 http://charlottenc.gov/CMPD/ResponseAreas/Pages/METRO.aspx
## 3 http://charlottenc.gov/CMPD/ResponseAreas/Pages/EASTWAY.aspx
## 4 http://charlottenc.gov/CMPD/ResponseAreas/Pages/NORTH_TRYON.aspx
## 5 http://charlottenc.gov/CMPD/ResponseAreas/Pages/NORTH.aspx
## 6 http://charlottenc.gov/CMPD/ResponseAreas/Pages/HICKORY-GROVE.aspx
## 7 http://charlottenc.gov/CMPD/ResponseAreas/Pages/UNIVERSITY-CITY.aspx
## 8 http://charlottenc.gov/CMPD/ResponseAreas/Pages/PROVIDENCE.aspx
## 9 http://charlottenc.gov/CMPD/ResponseAreas/Pages/INDEPENDENCE.aspx
## 10 http://charlottenc.gov/CMPD/ResponseAreas/Pages/STEELE_CREEK.aspx
## CreationDa Creator EditDate
## 1 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 2 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 3 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 4 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 5 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 6 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 7 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 8 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 9 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## 10 2019-02-17T10:01:34.557Z CharlotteNC 2019-02-17T10:01:34.557Z
## Editor Shape__Are Shape__Len geometry
## 1 CharlotteNC 115205697 64869.93 MULTIPOLYGON (((-80.8396 35...
## 2 CharlotteNC 364121807 99898.26 MULTIPOLYGON (((-80.85844 3...
## 3 CharlotteNC 285751984 76383.96 MULTIPOLYGON (((-80.76912 3...
## 4 CharlotteNC 358770831 82061.78 MULTIPOLYGON (((-80.77635 3...
## 5 CharlotteNC 1885005024 857975.19 MULTIPOLYGON (((-80.86812 3...
## 6 CharlotteNC 1012122243 328497.82 MULTIPOLYGON (((-80.664 35....
## 7 CharlotteNC 1305730182 160573.00 MULTIPOLYGON (((-80.76307 3...
## 8 CharlotteNC 528681680 119915.03 MULTIPOLYGON (((-80.81106 3...
## 9 CharlotteNC 529493155 134448.53 MULTIPOLYGON (((-80.74422 3...
## 10 CharlotteNC 1663219506 478339.68 MULTIPOLYGON (((-80.94803 3...
## Name
## 1 Central
## 2 Metro
## 3 Eastway
## 4 North Tryon
## 5 North
## 6 Hickory Grove
## 7 University City
## 8 Providence
## 9 Independence
## 10 Steele Creek
Now, let’s create a chloropleth. Below is the code to create an advanced plot.
In this problem, you need to explain what each step below is doing:
mutate(): Changing the variable name back to CMPD_Dvision from DNAME.inner_join(): Inner joining a count of the data by date and division.mutate(): Creating a new variable, year, from the date.geom_sf(): This is what creates the plot based off the coordinates.scale_fill_viridis(): Changes the color scheme used to fill the graphic.labs(): Adding labels to the graphicannotation_scale(): Shows a scale for distance at the bottom left of each graphic.facet_wrap(): Creates different graphics by year.theme_bw(): Translating the file into a black and white template.theme(): (what are each of the options doing in theme()?) The position changes where the lenend goes so in this case below the graphic. The next part chnages the size of the font and makes it bold. The next four get rid of the ticks and numbers for the longitudinal and latitudinal corrdinates.ggsave(): This saves the graphic to a pdf and png file.cmpd_chloropleth <- cmpd %>%
mutate(CMPD_Division = as.character(DNAME)) %>%
inner_join(count(df, CMPD_Division, Date), by = "CMPD_Division") %>%
mutate(Year = lubridate::year(Date)) %>%
ggplot() +
geom_sf(aes(fill = n)) +
scale_fill_viridis("Traffic Stops", labels = scales::comma) +
labs(title = "CMPD Traffic stops by CMPD Division",
caption = "Source: CMPD") +
annotation_scale(location = "bl", width_hint = 0.2) +
facet_wrap(~Year) +
theme_bw() +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold", size = rel(1.5)),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank())
cmpd_chloropleth
#ggsave(cmpd_chloropleth, filename = "cmpd_chloropleth.pdf",
#width = 7, height = 5, units = "in")
#ggsave(cmpd_chloropleth, filename = "cmpd_chloropleth.png",
# width = 7, height = 5, units = "in")
Go to ggextensions website. Then click Galleries to explore the different ggplot extensions. Scroll through and see if any catch your eye.
Now, select one of the ggextension libraries below and install the package (through CRAN):
Plot the related example
#install.packages('ggridges')
#install.packages('viridis')
library(ggridges)
##
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
library(viridis)
d <- data.frame(x = rep(1:5, 3) + c(rep(0, 5), rep(0.3, 5), rep(0.6, 5)),
y = c(rep(0, 5), rep(1, 5), rep(3, 5)),
height = c(0, 1, 3, 4, 0, 1, 2, 3, 5, 4, 0, 5, 4, 4, 1))
ggplot(d, aes(x, y, height = height, group = y, fill = factor(x+y))) +
geom_ridgeline_gradient() +
scale_fill_viridis(discrete = TRUE, direction = -1, guide = "none")
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`, fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_viridis(name = "Temp. [F]", option = "C") +
labs(title = 'Temperatures in Lincoln NE in 2016')
## Picking joint bandwidth of 3.37
Now, with the same package you ran, make a plot with that package and the gapminder data. You can choose any of the data frames (i.e., years). Make sure your plot has at least six functions (e.g., ggplot() + geom_point() is two functions and dplyr functions count as well.)
library(gapminder)
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Af…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 148803…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.…
ggplot(gapminder, aes(x=lifeExp, y=continent, fill=..x..))+geom_density_ridges_gradient()+scale_fill_viridis(name='Life')+labs(title='Life Expectancy by Continent')+theme_bw()+theme(legend.position = "bottom",
plot.title = element_text(face = "bold", size = rel(1.5)),
axis.ticks.x=element_blank(),
axis.ticks.y=element_blank())
## Picking joint bandwidth of 2.23
Describe what you have found using that plot (write at least 3 sentences) : By using ggridges I was able to graph how continent and life expectancy are related based on the gapminder dataset. As you can see Africa is overallmfar behind the other continents and is also pretty widely dispersed. Asia is very widely dispersed which would seem to indicate a wide range of life expectancies between countries in asia. It was surprising to see Oceania have the highest life expectancy but it could be partially due to the fact that that they fewer countries.
For even more fun, plot an interactive HTML plot using the code for any of the plots above (fair warning, some of the ggextensions may not work well).
The easiest way to do this is to use the plotly package (install it with the “Packages” panel in RStudio), and then to use its ggplotly() function.
I’ve given you some commented-out code below (commented out so that R doesn’t yell at you about the code not working when you knit).
Also, check out the documentation, especially this page about customizing the tooltips that show up when you hover over points or areas.
library(plotly)
my_cool_plot <- ggplot(D3, aes(x=Date, y=n))+geom_line()+facet_wrap(~CMPD_Division)+scale_x_date(date_labels = "%Y")
my_cool_plot
ggplotly(my_cool_plot)